I use functions provided by the creator of this Kaggle dataset: COVID-19 dataset.
It uses the data provided by https://www.worldometers.info/ and https://github.com/CSSEGISandData/COVID-19.

Import datasets

Worldometer data

Country/Region Continent Population TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop WHO Region
0 USA North America 3.314099e+08 6788147 0.0 200197.0 0.0 4068086.0 0.0 2519864.0 14165.0 20483.0 604.0 93631958.0 282526.0 Americas
1 India Asia 1.382827e+09 5020359 2325.0 82091.0 0.0 3942360.0 3249.0 995908.0 8944.0 3631.0 59.0 59429115.0 42977.0 South-EastAsia
2 Brazil South America 2.128756e+08 4384299 0.0 133207.0 0.0 3671128.0 0.0 579964.0 8318.0 20596.0 626.0 14505652.0 68141.0 Americas
3 Russia Europe 1.459477e+08 1073849 0.0 18785.0 0.0 884305.0 0.0 170759.0 2300.0 7358.0 129.0 41122307.0 281760.0 Europe
4 Peru South America 3.306653e+07 738020 0.0 30927.0 0.0 580753.0 0.0 126340.0 1460.0 22319.0 935.0 3567927.0 107901.0 Americas
array(['Americas', 'South-East Asia', 'Europe', 'Africa',
       'Eastern Mediterranean', 'Western Pacific'], dtype=object)
(183, 16)

There are some countries without assigned WHO region. I complete them by other dataframes and remove the ones that remained without category.

Population TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop
count 1.830000e+02 1.830000e+02 183.000000 183.000000 183.000000 1.830000e+02 183.000000 1.830000e+02 183.000000 183.000000 183.000000 1.830000e+02 1.830000e+02
mean 4.242424e+07 1.622741e+05 66.284153 5130.158470 3.874317 1.164053e+05 73.497268 3.491497e+04 332.442623 4490.808743 116.513005 3.165278e+06 1.132242e+05
std 1.521311e+08 7.080593e+05 434.596385 20117.428324 46.635959 5.047008e+05 487.837918 2.052767e+05 1443.124745 6528.498579 199.054827 1.470605e+07 2.237939e+05
min 3.394600e+04 1.400000e+01 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000e+00 0.000000 3.000000 0.000000 0.000000e+00 0.000000e+00
25% 2.760903e+06 2.458500e+03 0.000000 41.000000 0.000000 1.233500e+03 0.000000 3.110000e+02 0.000000 431.500000 7.500000 4.734950e+04 9.961500e+03
50% 9.915136e+06 1.040100e+04 0.000000 214.000000 0.000000 6.839000e+03 0.000000 1.835000e+03 5.000000 1942.000000 34.000000 2.392040e+05 4.228300e+04
75% 3.131727e+07 7.158600e+04 0.000000 1286.500000 0.000000 4.439600e+04 0.000000 9.910000e+03 69.000000 5619.500000 124.000000 1.320292e+06 1.232965e+05
max 1.439324e+09 6.788147e+06 4771.000000 200197.000000 629.000000 4.068086e+06 5273.000000 2.519864e+06 14165.000000 43527.000000 1237.000000 1.600000e+08 1.778435e+06

We can see that for some countries we have no data about the tests, resulting in TotalTests=0.
There are 183 countries taken into account.

I used this part in the sankey plot visualization. After normalising for each region (America would stifle others), I realised that the sum of Active, Deaths and Recoveries is not equal to Total for Europe. Therefore, I checked which countries have people missing.

Country/Region Continent Population TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop WHO Region difference
8 Spain Europe 46758621.0 603167 0.0 30004.0 0.0 0.0 0.0 0.0 1157.0 12900.0 642.0 10756835.0 230050.0 Europe 573163.0
13 UK Europe 67960923.0 374228 0.0 41664.0 0.0 0.0 0.0 0.0 106.0 5507.0 613.0 20292025.0 298584.0 Europe 332564.0
38 Sweden Europe 10112466.0 87345 0.0 5851.0 0.0 0.0 0.0 0.0 17.0 8637.0 579.0 1250488.0 123658.0 Europe 81494.0
39 Netherlands Europe 17142887.0 84778 0.0 6258.0 0.0 0.0 0.0 0.0 74.0 4945.0 365.0 1832451.0 106893.0 Europe 78520.0
1065741.0

This is really concerning. Over 1M cases has gone missing somewhere along the way.

Day wise - whole world

Date Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered No. of countries
0 2020-01-22 555 17 28 510 0 0 0 3.06 5.05 60.71 6
1 2020-01-23 654 18 30 606 99 1 2 2.75 4.59 60.00 8
2 2020-01-24 941 26 35 880 287 8 5 2.76 3.72 74.29 9
3 2020-01-25 1434 42 38 1354 493 16 3 2.93 2.65 110.53 11
4 2020-01-26 2118 56 51 2011 684 14 13 2.64 2.41 109.80 13
(238, 12)
Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered No. of countries
count 2.380000e+02 238.000000 2.380000e+02 2.380000e+02 238.000000 238.000000 238.000000 238.000000 238.000000 238.000000 238.000000
mean 8.331178e+06 349917.239496 4.111361e+06 3.869900e+06 124373.390756 3928.373950 75798.159664 4.581387 34.688529 21.988025 153.357143
std 8.885410e+06 303814.499677 5.251305e+06 3.449274e+06 98848.248522 2491.870812 78764.713278 1.529753 17.252679 22.668686 60.466842
min 5.550000e+02 17.000000 2.800000e+01 5.100000e+02 0.000000 0.000000 0.000000 2.040000 1.650000 5.180000 6.000000
25% 3.155050e+05 13590.250000 8.757075e+04 2.143440e+05 32327.500000 1527.500000 3464.000000 3.422500 19.872500 7.627500 165.250000
50% 4.953276e+06 328526.000000 1.382005e+06 3.242746e+06 96771.500000 4583.000000 44751.000000 4.205000 36.500000 13.705000 187.000000
75% 1.422989e+07 597187.750000 6.817994e+06 6.814712e+06 216018.500000 5782.250000 144440.250000 5.897500 49.910000 29.022500 187.000000
max 2.957057e+07 934970.000000 1.803999e+07 1.059561e+07 380492.000000 10134.000000 293068.000000 7.240000 61.010000 141.380000 187.000000

Day wise - countries separately

Date Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered WHO Region
0 2020-01-22 Afghanistan 0 0 0 0 0 0 0 Eastern Mediterranean
1 2020-01-22 Albania 0 0 0 0 0 0 0 Europe
2 2020-01-22 Algeria 0 0 0 0 0 0 0 Africa
3 2020-01-22 Andorra 0 0 0 0 0 0 0 Europe
4 2020-01-22 Angola 0 0 0 0 0 0 0 Africa
(44506, 10)

I checked the min and max values in the dataframe. As it occured, some values were given negative. As it does not make sense, I decided to treat it as a typo and preserve their absolute values.

Confirmed Deaths Recovered Active New cases New deaths New recovered
count 4.450600e+04 44506.000000 4.450600e+04 4.450600e+04 44506.000000 44506.000000 44506.000000
mean 4.455175e+04 1871.215184 2.198589e+04 2.069465e+04 665.098346 21.139419 407.847391
std 2.866340e+05 10332.427901 1.587160e+05 1.506736e+05 4258.557230 123.390527 3301.103067
min 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000
25% 1.200000e+01 0.000000 0.000000e+00 3.000000e+00 0.000000 0.000000 0.000000
50% 7.290000e+02 11.000000 7.100000e+01 2.380000e+02 6.000000 0.000000 0.000000
75% 8.534750e+03 178.000000 2.000000e+03 3.080000e+03 129.000000 2.000000 27.000000
max 6.605733e+06 195915.000000 3.942360e+06 3.914691e+06 173932.000000 4143.000000 162253.000000
array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
       'Western Pacific', 'South-East Asia'], dtype=object)
(187,)

I make a merged dataframe with consecutive WHO Regions - this way I fill wordometer data.

Country/Region WHO Region
0 Afghanistan Eastern Mediterranean
1 Albania Europe
2 Algeria Africa
3 Andorra Europe
4 Angola Africa

I want to have uniform color coding throughout this file. Therefore, I sort this dataframe by wordometer region order.

I check where total number of cases is not as it is supposed to be.

WHO Region Date Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered
19308 Europe 2020-06-23 Liechtenstein 82 2 81 1 0 1 0
32722 Africa 2020-07-20 Uganda 1069 0 1071 2 4 0 48

Color palettes

I use divik package to have consistent color theme for all palettes, no matter their length.

Regions

This palette provides colors for:

array(['Americas', 'South-East Asia', 'Europe', 'Africa',
       'Eastern Mediterranean', 'Western Pacific'], dtype=object)

Months

I make palette for one calendar year. This will have to change in "Month 1", "Month 2" and so on form in the future, as the pandemic continues, also in the dataframes.

Spotlight countries

I chose some more observed countries and Poland additionally for comparison. It can be seen later on.

Happiness report group

I chose some for comparison the group that was a result of hierarchical clustering in Happiness Report notebook. The group can be seen later on.

Total cases structure

This palette will be for observing the structure of cases so far for specified region: "Active", "Recovered" or "Death". It does not consist of 3 elements as the colors for 3 are not as diverging as I wanted.

Cases structure

This palette is for cases structure including the state of the "Active" - "Mild" and "Critical". I do not use it instead of previous as the revelant df does not have the date specified.

Testing

Palette for discretized tests/1M people.

Epidemics

Palettes for comparing epidemics part.

Violin plots

I use functions developed during previous visualizations.

discretized test/1M and death/confirmed

I want to check if the mortality ratio structures are significantly different for specified testing groups.

to 100k          81
to 1M            53
to 10k           28
None reported    17
to 10M            3
to 1k             1
Name: Tested/1M, dtype: int64
Country/Region Continent Population Tests/1M pop
102 Luxembourg Europe 628066.0 1191195.0
150 Andorra Europe 77291.0 1778435.0
169 Monaco Europe 39300.0 1321959.0
Country/Region Continent Population TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop WHO Region Deaths/1kCases Tested/1M
153 Niger Africa 24377782.0 1182 0.0 69.0 0.0 1104.0 0.0 9.0 0.0 48.0 3.0 19543.0 802.0 Africa 58.375635 to 1k

First of all, we can notice that at this point some countries tested more people than their population counts. They obviously have the best ratio.
Unsuprisingly as well, the countries with no recorded tests have higher mortality.
There is only one country that has less than 1 k reported tests if reported at all, therefore it wouldn't appear on the plots. I add it to 10ks. However, the tests to 1 million are moved to the penultimate place.

Death/cases for regions by each month

  • January: The virus is already in 3 regions. However, the deaths occur only in China.
  • February: South-East Asia is replaced with Americas and Eastern Mediterranean. There are first death cases in all these regions, which become the outliers in each group.
  • March: all the regions appear on the plot. All the fatal cases ratios are quite near 0 except the outlier points. Europa and Americas move to the top.
  • April: overall number of countries still increases. Western Pacific goes down in the position.
  • May: Europe goes up.
  • June: the ratio for outliers is much lower than in previous months.
  • July: Besides Yemen, the situation seems to be improving.
  • August: There is again small country outlier.

Death/cases for months by each region

  • Americas: first cases occured in February and the virus haven't let go ever since. Mexico is often an upper outlier. The ratio never exceeds 0.4 for non-miniscule countries.
  • South-East Asia - pandemic started in January with no fatal cases and withdrew in February (less than 3 countries infected). The ratio is lowest across the regions - 0.1 at max. The only outlier is Indonesia. The highest ratio seems to be going down.
  • Europe: the virus is since January (no deaths at this time) with no breaks. The ratio seemed to be going up till May, when it started decreasing. There are always several outliers.
  • Africa: the pandemic started only in March, already with fatal cases.
  • Eastern Mediterranean: since February with one country with fatal case.
  • Western Pacific: starting with China as the outlier.

Get sums for regions

Here I gather the sums of Tests and Cases for each region and normalize it accordingly. Because of the previous observations, I will treat Inactive cases as the difference between Total and Active. This way the sum will be equal. However, the Recoveries and Deaths will not sum up to the Inactive case number.

Days from the beginning of the pandemic

I change the form of the dataframes to cover only the days since the first case in specified group.

Regions

I start with the regions - this way it will be possible to investigate the course of the pandemic in each WHO Region specified.

Normalized

I normalize the data by the populations.

Chosen Countries

I also want to compare some of the countries. To do so, I gather their info and follow the previous procedure.

Spotlight

Here I use the following countries:

['India', 'Brazil', 'USA', 'Italy', 'UK', 'Iceland', 'China', 'Poland']

Happiness Report group

When I was first using the spotlight group, I had the data until the end of July. Poland seemed to be unfitting in the group, so to speak, so for the reasons such as population and so on I decided to choose another countries for comparison. As I did some visualisations on Happiness Report before and had some clustering on the last year data, I decided to reuse it here.

['Uruguay', 'Estonia', 'Czechia', 'Portugal', 'Slovenia', 'Panama', 'Poland']

Bubble plot

I want to compare how much of each case outcome category - fatal, recovered, mild and serious - is built of cases in specific regions.

Of course, each group is heavily dominated by patients in Americas, making up more than 50%. We can see that there are more deaths and serious cases in Europe than in other parts of the world. Sout-East Asia, in turn, has many mild cases and recovered patients.

Sankey plot

I use normalized data so that Americas won't overpower rest of the regions again. This time I check regions' structure, the other way around.
We can see that in comparison to other countries America has more active cases to inactive ratio. We can also notice that the death percentage is biggest in Europe.
It appears as if some cases are lost somewhere along the way in Europe.

Bar plots - Best in each category

Normalized Tests

Once again, we see the 3 countries that crosses the 1M/1M testing threshold. What is more, most of these countries are from Europe and there are none from Africa or South-East Asia.

Normalized cases

None from Africa or Western Pacific. Lot from Americas.

Total cases

Once again, no Western Pacific. We see the significant difference between first 3 countries and the rest.

Normalized deaths

South-East Asia seems coping pretty well. Yemen appear to be outlier not only for its group as Sudan is the next one from it and it's pretty much lower. There are many countries from Western Europe specifically.

Normalized recoveries

Most of these are small countries/regions.

Active cases

None from Africa. As expected, USA, India and Brazil are leading, with USA having significantly more active cases.

Area plots

Total cases structure

World

I am using day_wise data for this plot.

We can see that the percentage of active cases goes up and down for the second time now. The second rise started at the beginning of March, with its peak in early April.

Regions

  • The first thing that can and must be noticed is that the number of active cases across the dataframes vary drastically - in sankey active cases hold 25% of all cases so far, whereas here: 33% (Americas). As I want the insight into Mild and Serious cases in sankey and bubble plots, I cannot change used dataset.
  • In Western Pacific and Eastern Mediterranean the Deaths started before Recoveries.

Active cases over time - area plots

The pandemic came into the Africa last. In most regions active cases number keeps growing. However, in Western Pacific this growth in dependence on population is the smallest.

Line plots

Active cases by time - world

Just like expected based on the previous plot, the active cases number keeps increasing.

Cases/population from 1st confirmed

Regions

Total cases

Not normalized

Number of confirmed cases so far is changing the most drastically in Americas and South-East Asia.

Normalized:

Situation in Americas still looks the worst. However, the cases in Europe based on the region's population, are the second worst, unlike orevious plot.

Daily new cases

Not normalized

We can see the spurt in the new cases for Western Pacific on day 22 for this region. Another great growth is noted on day 237 for South-East Asia. The jumps for last few days are high.

Normalized

Peaks in Europe keep on reaching valeys in Americas. Western Pacific copes the best.

Chosen countries - spotlight

Total cases

Not normalized

Normalized

New cases

Not normalized

We see the reason of the sudden jump in the South-East Asia for new cases.

Normalized

Chosen countries from group

Total cases

Not normalized

Normalized

New cases

Not normalized

Normalized

  • Worst situation is in Panama.
  • Portugal seemed to be improving, but the new cases number started increasing again.
  • Czechia's new cases increase rapidly.

Maps

I use the iso codes to map the data onto globe map.

Country/Region code
0 Afghanistan AFG
1 Albania ALB
2 Algeria DZA
3 American Samoa ASM
4 Andorra AND

Active cases over time

For a long time (second half of March) there is significantly higher number of active cases in China than any other place. When the outbreak source starts to fade, we can see the spurt in Europe (several points) and beginning in US. The increases are not smooth. After that the rest of the world starts to be influenced as well. Cases in Africa are the last to be noticed.

Scatter plot

Tests and cases

The countries seem to be grouping.

Correlograms

I decided to use the Happiness Score data once again - the information about GDP, trust to the government, family relations, solidarity of the society and health might give us more insight into the pandemic situation.

Country/Region GDP family life expectancy freedom generosity trust
0 Finland 1.340 1.587 0.986 0.596 0.153 0.393
1 Denmark 1.383 1.573 0.996 0.592 0.252 0.410
2 Norway 1.488 1.582 1.028 0.603 0.271 0.341
3 Iceland 1.380 1.624 1.026 0.591 0.354 0.118
4 Netherlands 1.396 1.522 0.999 0.557 0.322 0.298

Whole world

We can see that countries with higher GDP can afford more tests for their populations.

Regions

I will be comparing only the variables pairs I have not considered before.

  • Americas: countries with higher GDP can afford more tests for their people.
  • South-East Asia: the more family-friendly the country, the less cases and deaths. Higher life expectancy means more recoveries.
  • Europe: the higher GDP, the more tests per population.
  • Africa: high correlation between deaths and cases.
  • Eastern Mediterranean: the higher GDP, the more cases.
  • Western Pacific: more trusted government makes more tests.

Pie chart - Cases by months

Considering that I took the data from the mid of September, it seems the raise in the new cases at this pace may not be this big as in the first months of pandemic. It looks as in May the growth was the least concerning.

Comparison with other epidemics

I decided to compare the current pandemic with some of the prior ones. I chose ebola outbreak from 2014, swine flu from 2009 and sars from 2003.

Import datasets

COVID

day_wise is used.

As I want to observe differences between progress of each disease, I convert consecutive dates to days from the first record. I know that the outbreak started sometime in December or even faster, therefore the record started about 2 months after the beginning of the disease.

Ebola

Country Date Cases Deaths
0 Guinea 2014-08-29 648.0 430.0
1 Nigeria 2014-08-29 19.0 7.0
2 Sierra Leone 2014-08-29 1026.0 422.0
3 Liberia 2014-08-29 1378.0 694.0
4 Sierra Leone 2014-09-05 1261.0 491.0

Some say the first cases are dated even to December 2013. This means that the record started almost 9 months after the outbreak. Therefore, I will be able to only compare the reactions and results after people decided that it is a real threat.

Swine flu

Date Country Cases Deaths
0 2009-04-24 Mexico 18 0
1 2009-04-24 United States of America 7 0
2 2009-04-26 Mexico 18 0
3 2009-04-26 United States of America 20 0
4 2009-04-27 Canada 6 0

The swine flu was noticed pretty fast in comparison with ebola - about 3 months after first cases. However, as the WHO did not further request individual cases report, the data is to around sixth month of the disease - it stops before ebola record starts. Therefore, comparison of the consecutive days after the outbreaks is impossible.

SARS 2003

Date Country Cases Deaths Recovered
0 2003-03-17 Germany 1 0 0
1 2003-03-17 Canada 8 2 0
2 2003-03-17 Singapore 20 0 0
3 2003-03-17 Hong Kong SAR, China 95 1 0
4 2003-03-17 Switzerland 2 0 0

The epidemic is said to have started in November of the year before. This once again means quite a huge difference between actual beginning and beginning of the record.

Cumulative beginnings

Here I collect the data from the first days of the record for each epidemic.

Cases Date Deaths No. of countries day Epidemic
0 555.0 2020-01-22 17.0 6.0 0.0 COVID
0 3071.0 2014-08-29 1553.0 4.0 0.0 Ebola
0 25.0 2009-04-24 0.0 2.0 0.0 Swine flu
0 167.0 2003-03-17 4.0 7.0 0.0 SARS

Cumulative ailments

Here I collect the data from the first 70 days of the record for each epidemic - the swine flu dataset records only 73 days, so comparison of the last records in most cases would be wrong.

Cumulative 70 days

Here I collect the data for mortality ratio on the 70th days of the record for each epidemic. This will allow me to compare the world medical reactions to the diseases.

Last measurement

Alongside the previous mortality ratio, I will check if the world is learning how to deal with the serious cases.

Cases Date Deaths Epidemic No. of countries day Mortality ratio
237 29570574.0 2020-09-15 934970.0 COVID 187.0 237.0 0.031618
258 28642.0 2016-03-23 11319.0 Ebola 12.0 572.0 0.395189
49 94451.0 2009-07-06 429.0 Swine flu 135.0 73.0 0.004542
95 8432.0 2003-07-11 813.0 SARS 31.0 116.0 0.096418

Number of countries for start of the record

We can see that COVID was more spreaded across the world at the start of the record than ebola and swine flu. Moreover, h1n1 was noticed at only 2 countries - it was USA and Mexico.

Number of cases for start of the record

It seems like ebola was much more disregarded. However, it seems like at this point - at the end of January - there should have been more cases of covid noted in China alone.

Deaths

COVID 2019 killed significantly more people than the rest of the ailments in the first 70 days.

Cases

It is also much more contagious. However, we can already notice that ebola cases number is much smaller than the death number. This suggests high mortality ratio.

Mortality ratio

I check the mortality ratio after 70 days - assuming that the world in each case had the same time to assess the situation after realising the threat.
We see that COVID-19 mortality ratio is much lower that ebola's or sars'. However, given the contagion pace, the new pandemic should not be disregarded - the number of deaths is still significantly higher.

We can see that the ratio for COVID-19 drops significantly - this may suggest progress in the situation, unlike for ebola and sars.